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Service-Level Indicator - a quantitative 
لاد‎ metric used to measure the compliance 
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Service Level Objective - a target SLI value 


9 or range of values within the SLA 


e SLI s target 
٥ lower bound = SLI د‎ upper bound 


Service Level Agreement - an agreement 
SLA between the service provider and the client 


e level of the services to be delivered 

٠ service measurement metrics 

e penalties if the expected level of services is not 
met 
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B precise with ALL definitions 
mazon SQS and SNS example 


Definitions |: SLI metric 


e |"Availability] is calculated for each 5-minute interval as the percentage of Requests processed by the applicable Included Service that do not fail 


with Errors and relate solely to the provisioned Included Service. If you did not make any Requests in a given 5-minute interval, that interval is 


assumed to be 100% available. 


An "Error" is any Request that returns al500 or 503 error code 


“Monthly Uptime | ercentage” for a given AWS region is calculated as thd average of the Availability for alb5-minute intervals in a monthly 
billing cycle. Monthly Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Messaging SLA Exclusion 


esse I 


(defined above). 


AfRequest”}s, with respect to: Low SII istmeasured 
o SNS: an API request to SNS by directly calling the Publish API or triggered by a supported event source; and 


o SQS: invocation of a SQS Send, Receive, or Delete API. 


A "Service Credit” is a dollar credit, calculated as set forth above, that we may credit back to an eligible account. 


AB) ys define common terms like “request”, "failure", etc 
2lOcean Kubernetes example 


Service Commitment 


DigitalOcean Kubernetes (DOKS) provides 99.95% uptime SLA per month for the control plane when high availability (HA) 
is enabled for such clusters. The SLA is effective on the billing period starting July 1, 2022. 


Definitions 
The terms used in this SLA document are defined as follows: 
e Monthly Uptime: For a given DOKS HA Cluster, monthly uptime is calculated by subtracting from 100% the percentage of 5- 
minute intervals during the monthly billing cycle in which the DOKS Cluster control plane was Unavailable. If the DOKS 


cluster exists for only part of the month, availability is calculated over the portion of the month during which it existed. 
Monthly Uptime measurements exclude Unavailability resulting directly or indirectly from any SLA exclusion. 


e Service Credit: Credit in terms of $USD issued to the associated DigitalOcean account. 


e Unavailability:|All the requests to the DOKS HA control plane of a clusterffail for more than 5 minutes 


Ays define common terms like “request,” “failure,” etc 
a2lOcean Kubernetes example 


Service Commitment 


DigitalOcean Kubernetes (DOKS) provides 99.95% uptime SLA per month for the control plane when high availability (HA) 
is enabled for such clusters. The SLA is effective on the billing period starting July 1, 2022. 


Definitions 


The terms used in this SLA document are defined as follows: 


e Monthly Uptime: For a given DOKS HA Cluster, monthly uptime is calculated by subtracting from 100% the percentage of 5- 
minute intervals during the monthly billing cycle in which the DOKS Cluster control plane was Unavailable. If the DOKS 
cluster exists for only part of the month, availability is calculated over the portion of the month during which it existed. 
Monthly Uptime measurements exclude Unavailability resulting directly or indirectly from any SLA exclusion. 


e Service Credit: Credit in terms of $USD issued to the associated DigitalOcean account. 


e Unavailability:|All the requests to the DOKS HA control plane of a cluster[fail for more than 5 minutes 
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@hare “designed for” SLO 


mazon S3 example 
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Formal SLA is stricter than 
SLO 
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hare Control Plane SLA/SLO 


gitalOcean Kubernetes and Amazon SQS/SNS examples 


DieitalOcean Kubernetes 
Control Plane/General ... Exontrol Plane SLA 


Availability (GA), now with a 
99.95% SLA | 
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Posted: June 23, 2022 + 4 min read | 1 ta Pla ne SLO IS Stricter 
Service Component آ""‎ l Availability Design 
Goal 
Amazon Simple Notification Service (Amazon SNS) Data Plane : 99.990% 
Control Plane 99.900% 
Amazon Simple Queue Service (Amazon SQS) Data Plane 99.980% 


Control Plane 99.900% 


@barate single instance and Multi-zonal SLAs 
gle Compute Engine example 


Covered Service Monthly Uptime 
Percentage 


--4 Instances in Multiple Zones >= 99.99% 0 
E --A Single Instance ٠ >= 99.5% 
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Ebverstate 


Bc = Zone, Town = Region 


Vendor 1 Vendor 2 Vendor 3 


@Standalone DC 
Q Availability Zone Lf» 
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Availability 
Objectives Maximizing uptime 


Focus on Components of the system 


Time Mean values over a mid-term 
period 
Scale Small-scale common 


disruptions 


Example architecture 
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Order 


Replica 


99,5% 
Order 


Dispageners 


bility = min(0.9999, 0.9999, 0.9999, 0.995, 0.995, 0.9995) = 99.5% 


Avallability with 


Dependencies > 
Availability. em = A; 
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Availability, sem = 0.99 * 0.995 = 
1A=99 98.5% 

_% 4 Hard dependencies = 
وس دي‎ t Availability 


Hard - workload cannot function without it 


Soft - unavallability can go unnoticed or tolerated for some 
period of time 


Avallability with full 


Redundancy un 
1 A2 
كا‎ Availability, sem = 1-(1-0.99) * (1-0.99) = 
pars 99.99% 


A,=99 A,=99 TRedundancy = 7 Availability 
% % 


A1,2 - Avallability of components 


Availability a, partial 
Redundar g 
Availability sem = 1-f*(1 f(s+1)! * (n-s-1)) 
5 A 
Gana Eee * (5- 
Aval | 


A=99% i HEY system = 1-10*(1-0.99)?*! = 


A-9996. ^ Number of nodes the system tolerates to failure 
# of 
nodes 5 
2 
3 
4 
99,9990 999999 
5 99,90% % % % 
99,9980 99,999985 99,99999994-99,9999999999 
6 99,85% % % % % 
A1,2 - Availability o 99,9965 


99,999965 99,99999979 99,9999999993 þer of spares, Fa 
iiv سو سر وس‎ af Lil AP rv 1 99,79% % % % 


0 % 


oility = 0.9999*0.9999*0.9999*(1-(1-0.995)*(1-0.995))*0.9995 = 99. 


011301111 = 0.9999*0.9999*0.9999 = 99.97% 


Time definition 
Amazon EC2 
example 


Uptime 
ptime+Downtime 


Availability | 


* "Instance-Level Uptime Percentage" is calculated byjsubtracting from 100% the percentage of minutes during the monthfin which a Single EC2 
Instance was in the state of Unavailability. 


* "Unavailable" and "Unavailability" mean: 


no external connectivity. 


o Forthe Instance-Level SLA, your Single EC2 Instance has 


Application categories 
Batch processing, data extraction, transfer and 
load jobs 
Knowledge management, project tracking 
99.95% 22 min. Online commerce 
99.99% 4 min. Video delivery, broadcast streaming 


Error Rate 
definition 
Amazon ۴ examp puccesstully Processed Units of Work 
Availability = 
Total Valid Units of Work 


applicable Included Service that do not fail with Errors and relate solely to the provisioned Included Service. If 
you did not make any Requests in a given 5-minute interval, that interval is assumed to be 100% available. 
e An “Error” is any Request that returns a|500 or 503 error code 


e e e e e 


Goal 99.5% 
SLA 99% 


if the system received 1000 requests, 
only 945 of them were processed 
successfully 


Availability 


Sev1 95% 


Time 


Which definition is better? 


Amazon services example 


Service 
aws Time 
> Error Rate 


O Time 


amazon alexa 


amazon.com Error Rate 
amazonadvertising Time 


prime اناه‎ Time 


Platform 


Error Rate 


Time 


Time 


Error rate is generally 


higher, long Low long peak Constant long peak 
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Availability - Time 


High short 


0.966 0.930 
% % 


Error rate Availability — — - Error rate Availability 24hrs 


External connectivity, 
Downtime during a month, 


nn intansal 
・ “Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes 


during the month in which Amazon EC2 was in the state of Unavailability. 
* "Unavailable" and “Unavailability” mean: 
o For the Region-Level SLA applicable to Amazon EC2, when all of your running instances 
deployed in two or more AZs in the same AWS region (or, if there is only one AZ in the AWS 
region, that AZ and an AZ in another AWS region) concurrently have no external connectivity. 


at least one Healthy Target, 


AA 1 ra + A^ هم وسع‎ 1 
"Healthy Targets" are the targets of the Load Balancer or GWLB, as applicable, that return a 


Success Code for the health check sent from the Load Balancer or GWLB, as applicable. 

"Monthly Uptime Percentage" is calculated by subtracting from 100% the percentage of minutes 
during the month in which any of the Multi-AZ Load Balancers, as applicable, were in the state of 
Unavailability. 

"Unavailable" and "Unavailability" mean: 


For the Multi-AZ Load Balancer SLA, your Multi-AZ Load Balancer, which is enabled in two or 
more AZs and has at least one Healthy Target, has no external connectivity and all attempts 
to connect to the Multi-AZ Load Balancer are unsuccessful. 


Order 
Dispa rs 


Web 


DB | 
Master | 
All connection requests 
Downtime in a monthly billing 
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+ "Monthly Uptime Percentage" for a given Multi-AZ DB Instance or Multi-AZ DB Cluster is 
calculated by subtracting from 100% the percentage of 1 minute intervals during the monthly 
billing cycle in which the Multi-AZ DB Instance or Multi-AZ DB Cluster was "Unavailable". If you 
have been running that Multi-AZ DB Instance or Multi-AZ DB Cluster for only part of the month, 
your Multi-AZ DB Instance or Multi-AZ DB Cluster is assumed to be 100% available for the 
portion of the month that it was not running. 

* "Unavailable" and "Unavailability" mean that all connection requests to the applicable running 
Multi-AZ DB Instance, Multi-AZ DB Cluster, or Single-DB Instance, fail during a 1 minute interval. 


Error Rate in a monthly 
billing cycle, avg. for 5- 


يت پټ 


* "Availability" is calculated for each 5-minute interval as the percentage of Requests processed by 
the applicable Included Service that do not fail with Errors and relate solely to the provisioned 
Included Service. If you did not make any Requests in a given 5-minute interval, that interval is 
assumed to be 100% available. 

* An "Error" is any Request that returns a 500 or 503 error code. 


External con 
Downtime during a month, 


nn nt へ ni 
・ “Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes 


during the month in which Amazon EC2 was in the state of Unavailability. 
* "Unavailable" and “Unavailability” mean: 
o For the Region-Level SLA applicable to Amazon EC2, when all of your running instances 
deployed in two or more AZs in the same AWS region (or, if there is only one AZ in the AWS 
region, that AZ and an AZ in another AWS region) concurrently have no external connectivity. 


CZ es: 


at least one HealtK مه‎ 


arm TATA PIS 1 
"Healthy Targets" are the targets of the Load Balancer or GWLB, as applicable, that return a 


Success Code for the health check sent from the Load Balancer or GWLB, as applicable. 

"Monthly Uptime Percentage" is calculated by subtracting from 100% the percentage of minutes 
during the month in which any of the Multi-AZ Load Balancers, as applicable, were in the state of 
Unavailability. 

"Unavailable" and "Unavailability" mean: 


. 


For the Multi-AZ Load Balancer SLA, your Multi-AZ Load Balancer, which is enabled in two or 
more AZs and has at least one Healthy Target, has no external connectivity and all attempts 
to connect to the Multi-AZ Load Balancer are unsuccessful. 


Web 


Order 
Dispa 


FS 


DB | 
Master | 
All connection requests 
Downtime in a billing 


mala 7 A ده‎ AS | 

+ "Monthly Uptime Percentage" for a given Multi-AZ DBgsfStance or Multi-AZ DB Cluster is 
calculated by subtracting from 100% the percentg@e of 1 minute intervals during the monthly 
billing cycle in which the Multi-AZ DB Instagg@ or Multi-AZ DB Cluster was "Unavailable". If you 
have been running that Multi-AZ DB Ipgfänce or Multi-AZ DB Cluster for only part of the month, 
your Multi-AZ DB Instance or M DB Cluster is assumed to be 100% available for the 
portion of the month that ijgffás not running. 

+ "Unavailable" and "Unagffilability" mean that all connection requests to the applicable running 
Multi-AZ DB Instag@@, Multi-AZ DB Cluster, or Single-DB Instance, fail during a 1 minute interval. 


FAR STORES. سي‎ 
* "Availability" is calculated for each 5-minute interval à e percentage of Requests processed by 


the applicable Included Service that do not fail with Errors and relate solely to the provisioned 
Included Service. If you did not make any Requests in a given 5-minute interval, that interval is 
assumed to be 100% available. 

* An "Error" is any Request that returns a 500 or 503 error code. 
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== VIA up comparison 


Uptime Percentage = (Maximum Service Uptime - number of minutes of Unavailability) / 


Vendor 1 


Maximum Service Uptime 
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Uptime Percentage = 100% - average(Error Rates measured over each 5min period during 


Vendor 2. the calendar month) 


| Error Rate 2 number of Valid Requests with an HTTP Status in the 500-range / total number 


of Valid Requests. 
© Repeated identical requests do not count towards the Error Rate unless they conform to 


| Back-off Requirements = when an error occurs, an Application is responsible for waiting 
for a period of time 


before issuing another request. The minimum back-off interval is 1sec and for each 
Vendor 3 . 
consecutive error, the back- 


off interval increases exponentially up to 32 seconds 


Vendor 4 Uptime Percentage = 100% - Average Error Rate 


| Average Error Rate = sum(Error Rates for each hour in the billing month / total number of 
hours 
| Error Rate = total number of Failed Transactions / Total Transactions during one hour 
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@ommit for penalties 
Amazon SOS and SNS example 


- * " の 
Service Commitment ESLA 
` AWS will use commercially reasonab le efforts t make the Included Services each available with a Monthly Uptime Percentage for each AWS region, 


during any monthly billing cycle, of at least 99.9% [the "Service Commitment"). In the event any of the Included Services do not meet the Service 
Commitment, you will be eligible to receive a Service Credit as described below. 


Service Credits 


Service Credits are calculated as a percentage of the total charges paid by you for the applicable Included Service in the applicable AWS region for 


_ o the monthly billing cycle in which the Monthly Uptime Percentage fell within the ranges set forth in the table below: 


Monthly Uptime Percentage Service Credit Percentage 
Less than 99.9% but greater than or equal to 99.0% 10% pe na Iti es 
۶ 
Less than 99.0% but greater than or equal to 95.0% 25% | XL LE 


Less than 95.0% 100% 


s just a fraction for failing your commitment 


Credit amount is expressed as a percentage of the service s 
monthly bill that did not meet SLA. It will be credited to future 


monthly bills. 
Vendor 1 
Monthly Service 
uptime Credits 


99.5%-99.0% 10% 
99.0%-95% 30% 
< 95.0% 100% 
> 6 minutes 100% 


Vendor 2 
Monthly Service 
uptime Credits 


99.5%-95.0% 10% 


95.0%-90.0% 25% 
< 90.0% 100% 


< 1 minute not counted 


Vendor 3 
Monthly Service 
uptime Credits 


99.9%-99.0% 


98.99%-90% 
< 90.0% 


> 3 minutes |not counteh 
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definition 


Repair Resume 
e starts normal e 
occur operations occur 


MTBF |‏ هه 
Availability MTBF+MTTR‏ 


e Mean time between failure (MTBF) - an average time between when a 
system begins normal operation and its next failure 


* Mean time to repair (MTTR) - a period of time when the system is 
unavailable while the failed component is returned to service 


e Mean time to detection (MTTD) - an amount of time between a failure 
occurring and when repair operations begin 


 Geaict and average 


y Quantitative instead of qualitative 


MITR 
— IR RER: MTBF 
M نال‎ ill ll 
Failure Repair Resume normal Failure 
occurs starts operations occurs 


MTx * are always derived from prediction or 
forecast 
X * are averages 


@ Focus on MTTR 


5-second rule :) 


MIBF‏ کاو 
Availability MTBF+MTTR‏ 


Availability 


OS 


Availability, MTTR Availability, MTBF 


But... resource that often goes down (MTBF) and recovers 
quickly (MTTR) might trigger expensive fallure handling 
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A Must read 


Neither of these definitions are adequat 
satisfactory because the design and deployment of cloud sys- 
tems, such as Azure [14], Dynamo [16], or Gmail, actively 
avoid single points of failure by sharding data across many 
machines and using replication [23] with failover when prob- 
lems occur. Consequently, these systems rarely have an out- 


metrics to quantify service reliability [10, 11]. A good avail- 
ability metric should be meaningful, proportional, and ac- 
tionable. By[“meaningful” ٥6 mean that 11 should capture 
what users experience. By [‘proportional” 6 mean that a 
change in the metric should be proportional to the change 
in user-perceived availability. By [ actionable” We mean that 
the metric should give system owners insight into why avail- 
ability for a period was low. This paper shows that none 


Meaningful Availability 


Tamás Hauer Philipp Hoffmann John Lunney Dan Ardelean Amer Diwan 
Google Google Google Google Google 
Abstract ratio is the fraction of the number of successful requests to 


High availability is a critical requirement for cloud appli- 
cations: if a sytem does not have high availability, users can- 
not count on it for their critical work. Having a metric that 
meaningfully captures availability is useful for both users 
and system developers. It informs users what they should 
expect of the availability of the application. It informs devel- 
opers what they should focus on to improve user-experienced 
availability. This paper presents and evaluates, in the context 
of Google's G Suite, a novel availability metric: windowed 
user-uptime. This metric has two main components. First, 
it directly models user-perceived availability and avoids the 
bias in commonly used availability metrics. Second, by si- 
multaneously calculating the availability metric over many 
windows it can readily distinguish between many short peri- 
ods of unavailability and fewer but longer periods of unavail- 
ability. 


1 Introduction 


Users rely on productivity suites and tools, such as G Suite, 
Office 365, or Slack, to get their work done. Lack of avail- 
ability in these suites comes at a cost: lost productivity, lost 
revenue and negative press for both the service provider and 
the user [1, 3, 6]. System developers and maintainers use 


metrics to quantify service reliability [10, 11]. A good avail- 
ability metric should be meaningful, proportional, and ac- 
tionable. By “meaningful” we mean that it should capture 
what users experience. By “proportional” we mean that a 
change in the metric should be proportional to the change 
in user-perceived availability. By “actionable” we mean that 
the metric should give system owners insight into why avail- 
ability for a period was low. This paper shows that none 


of the commonly used metrics satisfy these requirements 
and presents a new metric, windowed user-uptime that meets 
these requirements. We evaluate the metric in the context of 
Google's G Suite products, such as Google Drive and Gmail. 

The two most commonly used approaches for quantifying 
availability are success-ratio and incident-ratio. Success- 


total requests over a period of time (usually a month or a 
quarter) [5, 2, 9]. This metric has some important short- 
comings. First, it is biased towards the most active users; 
G Suite's most active users are 1000x more active than its 
least active (yet still active) users. Second, it assumes that 
user behavior does not change during an outage, although 
it often does: e.g., a user may give up and stop submit- 
ting requests during an outage which can make the measured 
impact appear smaller than it actually is. Incident-ratio is 
the ratio of “up minutes" to “total minutes”, and it deter- 
mines “up minutes” based on the duration of known inci- 
dents. This metric is inappropriate for large-scale distributed 
systems since they are almost never completely down or up. 


Our approach, windowed user-uptime has two compo- 
nents. First, user-uptime analyzes fine-grained user request 
logs to determine the up and down minutes for each user 
and aggregates these into an overall metric. By considering 
the failures that our users experience and weighing each user 
equally, this metric is meaningful and proportional. Second, 
windowed user-uptime simultaneously quantifies availability 
at all time windows, which enables us to distinguish many 
short outages from fewer longer ones; thus it enables our 
metric to be actionable. 


We evaluate windowed user-uptime in the context of 
Google's G Suite applications and compare it to success- 
ratio. We show, using data from a production deployment 
of G Suite, that the above-mentioned bias is a real shortcom- 
ing of success-ratio and that windowing is an invaluable tool 
for identifying brief, but significant outages. Our teams sys- 
tematically track down the root cause of these brief outages 
and address them before they trigger larger incidents. 


The remainder of this paper is organized as follows. Sec- 
tion [2] reviews related work. Section [3] motivates the need 
for windowed user-uptime Section (4 describes user-uptime. 
Section 5| extends user-uptime with windowed user-uptime. 
Section [6] evaluates our approach and Section [8] concludes 
the paper. 


Goodhart's law: every measure that becomes a 
target, 
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Amazon S3 according to the status page. 
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Amazon 53 according to the 


status page. 


e Accept failure with post-mortem 


aws.amazon.com/message/41926/ 


Summary of the Amazon|S3 Service Disruption|in the 


Northern Virginia (US-EAST- 1) Region 


We'd like to give you some additional information about the service disruption that occurred in the Northern 
Virginia (US-EAST-1) Region on the morning of February 28th, 2017. The Amazon Simple Storage Service (S3) 
team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM 
PST, an authorized S3 team member using an established playbook executed a command which was intended 
to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. 


Unfortunately] one of the inputs to the command was entered incorrectly Bnd a larger set of servers was 


removed than intended. The servers "gy inadvertently removed supported two other S3 subsystems. 


Finally, we want tolapologizelfor the impact this event caused for our customers. While we are proud of our 
long track record of availability with Amazon S3, we know how critical this service is to our customers, their 
applications and end users, and their businesses. We will do everything we can to learn from this event and 


use it to improve our availability even further. 


My 
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7 
Durability 
Objectives Minimizing suffer from 
data loss and corruption 
Focus on Data the system works 
with 
Time Long-term likelihood of 


data loss or corruption 


Scale Small-scale corruption 
and large-scale loss 
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Durability and AFR 


Durability - likelihood of avoiding data loss or corruption 
Data health = the data you retrieved it is the same as you stored 


Durability n = 1 - 


| .. AFR. cto " 
Annualized Failure RACE (AFR) - probability of system 


failure during a year 

Durability is the inverse of AFR, but in many cases, Durability is detached 
from AFR and calculated by itself. That is why it’s defined in a broader 
scope than the annual rate. 


Durability and AFR 
AWS EBS and S3 example 


Durability 


EBS gp3 volume 
AFR=0.2% 
Durability=99.8% 


S3 bucket 
AFR=0.000000001% 
Durability=99.999999999 
% (11x) 


Explanation 
avg. loss of 0.2% of volumes over a given 


year 
( E] | 
if you have 1,000 vdlumes running for one 


year, 
you should expect about 2 volumes to fail 


if you pi 10,000 objects, 

on avg. eu them every 
10,000,000 year 

if you store 1,000,000,000 objects, 

on avg. you may lose one of them every 100 
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For AFR, uses industry-accepted guidelinesland assumes a conservative 
drive AFR of 5%. Tn practice, our observed AFRS are much lower. For MTTR, | 
estimates 3.4 days per drive, based on the calculation below: 


MTTR = 14 TB drive capacity * 50 MB/s drive write speed = 3.4 days to fully write a replaced 14 TB drive 


As stated earlier in the erasure coding discussion, in order for an object to be lost, 
more than 4 drives in a storage slice would have to fail. To understand the 
probability of this, the first step is to understanding the probability a single drive 
failing using the calculation below: 


Probability of a 1 drive failing = AFR (5% year) * MTTR (3.4 days) * (1/365 year/days) = 4.66 x 10* 
The next step would be to understand the probability of four drives failing while 
another drive in the storage slice was rebuilding. This would be a potential data loss 


scenario because five drives in a storage slice would not be available (one in rebuild ۰ 
mode, plus four new failures). To calculate this probability, this formula applies: 


Probability of 4 drives failing = Probability of 1 drive failing (4.66 x 10*) to the 4th power = 4.7 x 10° 


The final step in calculating data durability is to factor in the probability of 4 drives 
failing using the following formula: 


Data Durability = 1 - (probability of 4 drives failing) = 1 - 4.7 x 10'^ =.99999999999995 


As seen in the above calculations, . storage architecture provides greater than 
11 x 9s of durability. The calculated number is لست‎ 9s but for the sake of 

taking a conservative approach to the calculations and to align with how most of the 
hyperscalers position their data durability, uses 11 x 9s as the published data 
durability metric. 
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ercovered calculation is adjusted to a common value 
ate Lyve cloud $3 example 


data durability. When further combined with advanced erasure code, where storage nodes are distributed across 
multiple racks and systems, Lyve Cloud takes data durability of public cloud S3-compatible storage to a new level, nearly 


29 rtines|(0.99999999999999999999999999999). We state a 
(0.99999999999) durability to match the industry standard offered by other providers. 


@isson or Markov 


e: with Poisson distribution or Markov chair E E» 
E 04 
n! ADI 
R(n,m) re in-m-1)! um 
(n-1)À (n-m+1)A (n-m)A 
Pa: | a کے‎ 
0 failures 1 failure m failures er 
n disks n-1 disks n-m disks ata loss! 


remaining remaining remaining 


AFR 0,8% AFR 0,8% 

MTTR, hrs 48 MTTR, hrs 48 

# of disks, n 20 # of disks, n 20 
# of failures to # of failures before 

lose, k 4 loss, m 3 

Failure rate, A 0,000876712 MTBF, hrs 1091361 

9,9999999999998E Failure rate, A 9,16287E-07 

Interval durability -01 Recovery rate, [] 0020833355 

Annual durability 0,999999999996 Data loss rate per hr 1,51E-15 

Annual durability 0,99999999999 

11x LLA دت‎ 


Oc A 


zero fallure rate 
Azure managed disks example 


Managed disks are designed for 99.999% availability. Managed disks achieve this by 
providing you with three replicas of your data, allowing for high durability. If one or 
even two replicas experience issues, the remaining replicas help ensure persistence 
of your data and high tolerance against failures. This architecture has helped Azure 
consistently deliver enterprise-grade durability for infrastructure as a service (laaS) 


disks, with an|industry-leading ZERO% annualized failure rate 


| 0 | do not trust ZERO% |: 
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storage example‏ د 


Ef Microsoft Azure 


with 
Microsoft Azure 


Building reliable systems on Azure is a shared responsibility. Microsoft 
is responsible for the reliability of the cloud platform, including our global 
network and datacenters. Our customers and partners are responsible for 

the reliability of their cloud applications, using architectural best practices 

based on the requirements of each workload. 


No matter what your service-level objectives are, Azure can help you 
achieve your organization's reliability goals. Design and operate 
mission-critical systems with confidence by taking advantage of built-in 
features for high availability, disaster recovery, and backup. 


Maintain acceptable continuous performance 
despite temporary failure in services, hardware, or 
datacenters—as well as fluctuation in load—using 
Azure Availability Zones and availability sets. 


Protect against the loss of an entire region 
through asynchronous replication for failover of 
virtual machines and data using services like 
geo-redundant storage and Azure Site Recovery. 


Single VM 


Improve the availability of 
single-instance VMs by using 
premium/ultra disks to qualify 

for an availability SLA. 


99.9% SLA (3 9s) 
VM availability (monthly) 


with premium/ultra disks 


99.999999999% (11 9s) 
Storage durability (annually) 


» Virtual machine | Compute 
& Storage account | 


Optional: Azure Backup 


Local redundancies 


Protect against failures with redundancy within 
a single datacenter in the event of hardware 
malfunctions or software update cycles. 


99.95% SLA (3% 9s) 
VM availability (monthly) 


within a datacenter 


Managed Disk in Availability Set 


99.999999999% (11 9s) 
Storage durability (annually) 


Zonal redundancies 


Protect against datacenter failures through 
redundancy within a single region in the 
event of power, cooling, or networking issues. 


99.99% SLA (4 9s) 
VM availability (monthly) 


within a region 


Zone 1 - 


99.9999999999% (12 وو‎ 
Storage durability (annually) 


Regional redundancies 


Protect against entire-region failures with redundancy 
beyond a single region in the event of a tornado, 


earthquake, or other large-scale disaster. 


RPO and RTO 


Primary region 


Failover 
Azure Site Recovery 


Failback 


Storage replication 


Failback 


وو 16( 99.99999999999999% 


Storage durability (annually) 


Secondary region 


pies to apples -> one-to-one comparison 


azon 53 and EBS example 


99999999999 99999999999 
% 


Designed for Y 99.999999999% y, 


Designed for 


plane Availability 99.999% 99.999% 99.999% 


9979357 997937 99.95% 


plane Availability 
Instance level 


Availability SLA 99.5% 99.5% 99.5% 


Same Durability 
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but different 
Availability 


Same Availability 


but different 
Durability 
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